## Term_f LoanStatus ClosedDate
## 12: 1613 Current :56576 Min. :2009-09-09
## 36:58510 Completed :19498 1st Qu.:2012-06-19
## 60:24545 Chargedoff : 5326 Median :2013-02-25
## Defaulted : 996 Mean :2012-12-25
## Past Due (1-15 days) : 806 3rd Qu.:2013-09-12
## Past Due (31-60 days): 363 Max. :2014-03-10
## (Other) : 1103 NA's :58848
## LoanCurrentDaysDelinquent LoanFirstDefaultedCycleNumber
## Min. : 0.00 Min. : 1.00
## 1st Qu.: 0.00 1st Qu.: 9.00
## Median : 0.00 Median :13.00
## Mean : 36.52 Mean :14.46
## 3rd Qu.: 0.00 3rd Qu.:19.00
## Max. :1593.00 Max. :41.00
## NA's :78442
## LoanMonthsSinceOrigination LoanOriginationDate LoanOriginationQuarter
## Min. : 0.00 Min. :2009-08-11 Q4 2013:14449
## 1st Qu.: 4.00 1st Qu.:2012-02-27 Q1 2014:12172
## Median :11.00 Median :2013-04-09 Q3 2013: 9180
## Mean :15.94 Mean :2012-11-18 Q2 2013: 7099
## 3rd Qu.:25.00 3rd Qu.:2013-11-05 Q3 2012: 5632
## Max. :55.00 Max. :2014-03-12 Q2 2012: 5061
## (Other):31075
## MonthlyLoanPayment LoanOriginalAmount PercentFunded Investors
## Min. : 0.0 Min. : 1000 Min. :0.7000 Min. : 1.0
## 1st Qu.: 157.6 1st Qu.: 4000 1st Qu.:1.0000 1st Qu.: 1.0
## Median : 252.2 Median : 7500 Median :1.0000 Median : 31.0
## Mean : 292.3 Mean : 9094 Mean :0.9981 Mean : 68.1
## 3rd Qu.: 389.0 3rd Qu.:13750 3rd Qu.:1.0000 3rd Qu.: 97.0
## Max. :2251.5 Max. :35000 Max. :1.0125 Max. :1189.0
##
## listingCategory
## Home Improvement :53084
## Other : 9178
## Business : 6783
## Personal Loan : 5285
## Auto : 2230
## Household Expenses: 1996
## (Other) : 6112
##
## FALSE TRUE
## 8909 75759
## Source: local data frame [6 x 4]
##
## OriginationYear mean_ median_ n
## (chr) (dbl) (dbl) (int)
## 1 2009 4401.833 3000 1807
## 2 2010 4775.103 3900 5564
## 3 2011 6692.021 5500 11228
## 4 2012 7833.842 5000 19553
## 5 2013 10545.300 10000 34344
## 6 2014 11912.220 10000 12172
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 1.0 31.0 68.1 97.0 1189.0
##
## 1 2 3 4 5 6 7 8 9 10
## 27439 1200 872 696 615 565 553 614 546 526
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 5000 10000 11100 15000 35000
Most of the loans are spent on home improvement. The amount ranges from 1000 to 35000$ ( median = 7500$ ), while most of the loans have an amount of up to 15000$ (where 500$ , 1000$ and 15000$ are most common).
We can also see a first trend that since 2009 the average amount that is borrowed increased from around 4000$ to around 10000$ in 2013. An the trend seems to continue for the beginning of 2014 (we only have data till march 2014).
Most of the loans are in status ‘current’ and ‘completed’ which is what you expect for a working platform. However, the number of loans in status current is very high (more than 60%), especially as we have data of almost 5 years and the average loan runs for 3 years, so there should be a fair amount of loans that reached the completed/defaulted status. If we add time information (reference time is about march 2014), we see that Prosper saw a tremendous increase of loans being funded through its platform in the last year (2013). And as the average loan runs for 3 years this explains why so many loans are in status ‘current’.
By far the most loans have one single investor (27440), but some have up to 1189 investors (the median is 68 investors). It looks like quite a number investors decide to invest in specific loans (providing the full amount) instead of using the tools like QuickInvest provided by the plattform. The dataset does not contain info on how the amount is split among investors. But we can see, that single investors are common among all loan amounts.
## [1] 19498 88
## LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees
## Min. : 39.34 Min. : -0.003 Min. :-589.95
## 1st Qu.: 3000.00 1st Qu.: 322.480 1st Qu.: -75.08
## Median : 4999.26 Median : 744.490 Median : -37.72
## Mean : 6448.79 Mean : 1070.187 Mean : -55.35
## 3rd Qu.: 8500.00 3rd Qu.: 1494.088 3rd Qu.: -17.07
## Max. :35000.00 Max. :10013.570 Max. : 2.87
## LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss
## Min. :-1115.630 Min. :0 Min. :0
## 1st Qu.: 0.000 1st Qu.:0 1st Qu.:0
## Median : 0.000 Median :0 Median :0
## Mean : -3.126 Mean :0 Mean :0
## 3rd Qu.: 0.000 3rd Qu.:0 3rd Qu.:0
## Max. : 0.000 Max. :0 Max. :0
## LP_NonPrincipalRecoverypayments LP_CustomerPayments
## Min. :0 Min. : 75.86
## 1st Qu.:0 1st Qu.: 3524.39
## Median :0 Median : 5647.26
## Mean :0 Mean : 7518.97
## 3rd Qu.:0 3rd Qu.:10302.02
## Max. :0 Max. :37369.16
## [1] 219 88
## (-1.5e+04,-1e+03] (-1e+03,-100] (-100,-10] (-10,10]
## 1 55 67 37
## (10,100] (100,1e+03] (1e+03,1.5e+04]
## 36 16 7
## [1] 15001.029 7500.596 13000.128 7500.697 15000.000 4000.000 20000.000
## [8] 4000.272 9499.755 20002.743 6500.558 11001.395 15000.562 8000.753
## [15] 15001.385 5000.000 8999.776 15001.258 5000.000 7500.000 5999.767
## [22] 7499.662 7500.595 8500.716 6000.493 6000.000 10001.396 10000.000
## [29] 6999.886 15000.000 10000.000 7000.422 10000.000 4000.555 13500.618
## [36] 7000.236 15001.408 8000.460 7500.355 15000.000 11999.727 10000.000
## [43] 4499.900 13001.614 12001.139 4000.679 15001.415 14500.891 21002.294
## [50] 13500.626 15000.502 7500.312 15001.623 8400.686 15001.011 3499.939
## [57] 10001.245 15001.001 5000.702 15000.508 5500.199 8000.455 21000.420
## [64] 10001.369 4000.285 20000.000 4000.852 10000.000 15001.500 7000.501
## [71] 12001.144 5000.000 20001.419 15001.390 7000.000 5000.000 10000.000
## [78] 8501.127 17000.799 15000.577 15001.362 25001.376 25000.711 13000.446
## [85] 10000.000 7500.300 12501.183 7500.355 15001.350 4000.211 4000.776
## [92] 7500.764 10000.000 25002.697 9000.524 3000.000 8000.000 15001.971
## [99] 25001.197 4000.556 12001.400 15000.000 10001.312 15002.062 4000.508
## [106] 10000.000 19501.862 4001.023 3999.729 15001.410 15000.616 9000.910
## [113] 10000.000 10000.000 4000.713 15000.681 15000.000 6000.000 20002.703
## [120] 10500.766 4000.000 6000.242 7000.672 15000.000 15000.000 4000.273
## [127] 12001.260 2999.867 4000.816 15001.257 11000.991 5000.000 12501.444
## [134] 25000.000 8000.000 15000.657 12000.955 12000.252 25003.536 5000.000
## [141] 10000.000 12001.687 7000.470 9499.868 24502.038 15001.011 4000.422
## [148] 5500.288 10500.729 15001.054 10000.000 10001.053 5499.840 5400.046
## [155] 25001.817 7500.259 15001.370 25001.679 25000.000 15001.404 15001.818
## [162] 14501.481 7500.297 15002.137 5225.047 10001.402 25001.723 4000.000
## [169] 15002.058 17001.606 10000.000 15000.000 4000.549 4001.115 7500.554
## [176] 20001.158 20000.000 7500.609 15000.000 7500.346 12000.850 15002.136
## [183] 15001.423 15001.420 5000.700 4500.068 15000.542 3999.764 8000.200
## [190] 6000.000 7500.355 15000.650 15001.395 25001.341 4000.000 10000.000
## [197] 15000.534 4000.211 12500.972 5000.000 10000.000 15002.099 10001.133
## [204] 7500.295 11501.721 3499.770 10000.000 13500.434 7999.782 10001.365
## [211] 4000.411 9600.100 4000.516 21002.607 4000.281 3000.123 4000.839
## [218] 15999.785 9000.659 12000.247 17500.552 15001.412 15500.418 25002.857
## [225] 11500.922 15002.114 13001.555 24502.195 15001.357 15002.078 3999.528
## [232] 15001.203 10499.930 15000.695 4000.000 5000.551 24000.000 10001.367
## [239] 12500.000 15001.249 4000.000 7000.000 15000.710 15000.000 5000.534
## [246] 10000.000 5500.779 8000.891 13001.470 17930.302 8500.250 10000.000
## [253] 6500.265 15001.353 15000.526 15002.128 25000.554 25002.125 11999.737
## [1] 0 88
I first want to look only at loans in status ‘completed’ to get a feel for the features LP_CustomerPayments, LP_CustomerPrincipalPayments, LP_InterestandFees, LP_ServiceFees. As these loans are in status ‘completed’ all principal, interest and fees should be payed. In the summary for this subset we see no loss or recovery payments, but some collection fees. So in between payment was late, but in the end everything was repayed. In the majority of the cases (19279 out of 19498) LoanOriginalAmount and LP_CustomerPrincipalPayments are equal (difference is smaller +/- 1). But there are a few cases (219) with partially considerable discrepancies of hundreds or tousands that were payed back too much or too few. When looking at these cases some were payed back after only a few months, but had a term of 3 or even 5 years. Maybe some amounts are not correctly accounted for? With the data available in this dataset we cannot answer this question, as we have no info on the single payments, only aggregated values. The percent funded is already incorporated in LoanOriginalAmount. This amount is not the amount originally requested by the borrower, but only the amount that was actually funded. LP_CustomerPayments = LP_CustomerPrincipalPayments + LP_InterestandFees holds for all loans, no outliers here.
## BorrowerState Occupation EmploymentStatus
## CA :10731 Other :21271 Employed :67309
## NY : 5839 Professional :10523 Full-time : 7758
## TX : 5627 Executive : 3464 Self-employed: 4532
## FL : 5401 Computer Programmer: 3220 Other : 3806
## IL : 4248 Teacher : 2882 Not employed : 647
## OH : 3367 Analyst : 2729 Retired : 364
## (Other):49455 (Other) :40579 (Other) : 252
## EmploymentStatusDuration IsBorrowerHomeowner DebtToIncomeRatio
## Min. : 0.0 False:39922 Min. : 0.000
## 1st Qu.: 30.0 True :44746 1st Qu.: 0.150
## Median : 74.0 Median : 0.220
## Mean :103.1 Mean : 0.259
## 3rd Qu.:148.0 3rd Qu.: 0.320
## Max. :755.0 Max. :10.010
## NA's :19 NA's :7279
## IncomeRange IncomeVerifiable StatedMonthlyIncome
## $50,000-74,999:25561 False: 7318 Min. : 0
## $25,000-49,999:24117 True :77350 1st Qu.: 3442
## $100,000+ :15189 Median : 5000
## $75,000-99,999:14468 Mean : 5934
## $1-24,999 : 4644 3rd Qu.: 7083
## Not employed : 647 Max. :1750003
## (Other) : 42
## [1] 688
## [1] 0
## [1] 416666.7 483333.3 108333.3 150000.0 618547.8 1750002.9 466666.7
## [8] 120833.3 158333.3 250000.0 108750.0 394400.0
## Source: local data frame [1 x 1]
##
## n
## (int)
## 1 12185
The majority of the borrowers comes from California, followed by NY and Texas. Most of them have some kind of employment, only few indicated to be unemployed or retired (647 and 364). More than half of the borrowers are homeowners. The median stated monthly income is 5000 $. There are some outliers with monthly incomes above 100000 $. If we cut off the highest 1% of incomes we get a slightly positively skewed distribution of incomes. The stated monthly incomes are consistent with the associated income ranges. When breaking out by loan status we see that the income distributions for each loan status are pretty similar to the overall income distribution, so income does not seem to be a strong indicator for whether a loan is payed back in time or not. Note that 12185 borrowers have or had multiple loans in the time between 2009 and 2014.
## LenderYield EstimatedEffectiveYield EstimatedLoss
## Min. :0.0300 Min. :-0.1827 Min. :0.00490
## 1st Qu.:0.1259 1st Qu.: 0.1166 1st Qu.:0.04240
## Median :0.1775 Median : 0.1615 Median :0.07240
## Mean :0.1861 Mean : 0.1688 Mean :0.08034
## 3rd Qu.:0.2474 3rd Qu.: 0.2243 3rd Qu.:0.11200
## Max. :0.3400 Max. : 0.3199 Max. :0.36600
##
## EstimatedReturn ProsperScore ProsperRating BorrowerAPR
## Min. :-0.18270 4 :12594 AA: 5338 Min. :0.04935
## 1st Qu.: 0.07455 6 :12271 A :14509 1st Qu.:0.16361
## Median : 0.09211 8 :12007 B :15566 Median :0.21945
## Mean : 0.09609 7 :10581 C :18303 Mean :0.22674
## 3rd Qu.: 0.11660 5 : 9806 D :14248 3rd Qu.:0.29254
## Max. : 0.28370 3 : 7639 E : 9781 Max. :0.42395
## (Other):19770 HR: 6923
## BorrowerRate
## Min. :0.0400
## 1st Qu.:0.1359
## Median :0.1875
## Mean :0.1961
## 3rd Qu.:0.2574
## Max. :0.3600
##
## LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees
## Min. : 0.0 Min. : -2.35 Min. :-589.95
## 1st Qu.: 392.4 1st Qu.: 254.08 1st Qu.: -72.33
## Median : 1247.3 Median : 674.95 Median : -34.90
## Mean : 2624.3 Mean : 1042.86 Mean : -54.65
## 3rd Qu.: 3400.0 3rd Qu.: 1436.66 3rd Qu.: -14.28
## Max. :35000.0 Max. :10572.78 Max. : 3.01
## LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss
## Min. :-4865.080 Min. : -94.2 Min. : -504.4
## 1st Qu.: 0.000 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 0.000 Median : 0.0 Median : 0.0
## Mean : -8.198 Mean : 376.2 Mean : 367.9
## 3rd Qu.: 0.000 3rd Qu.: 0.0 3rd Qu.: 0.0
## Max. : 0.000 Max. :25000.0 Max. :25000.0
## LP_NonPrincipalRecoverypayments LP_CustomerPayments
## Min. : 0.000 Min. : -2.35
## 1st Qu.: 0.000 1st Qu.: 797.52
## Median : 0.000 Median : 2213.68
## Mean : 7.639 Mean : 3667.20
## 3rd Qu.: 0.000 3rd Qu.: 4868.19
## Max. :7780.030 Max. :37369.16
## [1] 84628 88
##
## Pearson's product-moment correlation
##
## data: as.numeric(loanData$ProsperRating) and loanData$EstimatedLoss
## t = 1061.6, df = 84666, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9639573 0.9648986
## sample estimates:
## cor
## 0.964431
##
## Pearson's product-moment correlation
##
## data: as.numeric(loanData$ProsperScore) and loanData$EstimatedLoss
## t = -265.38, df = 84666, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6775265 -0.6701722
## sample estimates:
## cor
## -0.673866
##
## Pearson's product-moment correlation
##
## data: as.numeric(loanData$CreditScoreRangeUpper) and loanData$EstimatedLoss
## t = -172.82, df = 84666, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5156107 -0.5056518
## sample estimates:
## cor
## -0.5106484
The BorrowerRate is the LenderYield plus 1% (probably service fee). Lender Yield is in the range from 3 to 34%. In 2009-11 there are loans over the full range, while in the year 2012-14 the range reduces more and more (in 2014 it is between 5% and 30%). In 2011 and 2012 we have high peaks at around 30% while in 2013/14 the distribution of lender yield becomes more balanced. Furthermore we see that the lender yield becomes more and more aligned with the Prosper Rating over the years. Prosper uses an agency score and calcualates a ProsperScore. In addition there is a ProsperRating giving Prosper’s risk assessment (risk of loss) for the respective loan. And obviously when plotting the Prosper Rating against the Estimated Loss we find a clear correlation. ProsperScore and ProsperRating seem to be correlated, as well as the ProsperRating and the agency score. When looking it up in the docu you find that the ProsperRating takes into acount the ProsperRating and the agency score together with some additional parameters. What is intersting is that in 2014 it looks like Prosper increased the minimum agency score that a borrower must have to be allowed to apply for a Prosper loan. Might be that this is also related to the more balanced distribution of lender yield.
## 12 36 60
## 1613 58510 24545
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 8.00 13.00 16.07 23.00 41.00 58848
##
## FALSE TRUE
## 2875 16623
When we look at the closed date then we see that the number of closed loans increased over time, especially in the last 2 years. Maybe some of the many newly originated loans were payed back early? So the high number of new loans also leads to an increase in loan completion?
Looks like many loans are payed back early. Loan periods are 12, 36 and 60 months. Most common is a term of 36 months, while 12 months is pretty rare. When we look at the actual difference between LoanOriginationDate and ClosedDate (actualTerm in the graphs) we see 2 peaks for the completed loans, one around 12 months and one at about 36 months, as expected. But especially in the first 12 months there is a high number of completions and also between 12 and 36.
When calculating the ratio of the indicated term and the actual difference between origination and close date we see that the majority is repayed a few or even many months before its term ends (22791 out of 25820). And that seems to be true for older as well as for newer loans.
##
## (0,5] (5,10] (10,15] (15,20] (20,25] (25,30] (30,35] (35,40] (40,45]
## 3765 5489 5513 3473 2404 1672 1238 2244 22
## (45,50] (50,55] (55,60]
## 0 0 0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1493 513 1459 2219 3178 17440
##
## (999,4e+03] (4e+03,5.25e+03] (5.25e+03,1e+04] (1e+04,1.5e+04]
## 28072 5804 24039 17844
## (1.5e+04,3.5e+04]
## 8909
The distribution of the interest difference (expected (ideal) amount of interest - actual amount of interest) has a very long tail. Log-transforming it gives a more normal distribution, however with a strong tail to left. Note, that for log-transforming it I left out the negative values (where people payed more than the expected interest amount). Coloring the distribution of the log-transformed interest difference shows nicely how the the interest difference increases with shorter repayment terms and longer agreed loan term. The loan amount also seems to be a factor, on the other hand for the Prosper Rating there is not a clear picture.
The dataset contains information about 84668 loans that are processed through the Prosper platform. The dataset containd 87 varables for each loan reaching from on - the loan itself (amount, when it was originated, how many investors,…) - the status of the loan (aggregated info on customer payments or loss, payed interest, fees and principal) - the borrower (income, emloyment status/duration, state,…) - credit score and other indicators on credit-worthiness provided by an agency - on credit-worthimess based on previous prosper loans - and a set of figures provided by prosper to rate the given loan (prosper score, lender yield, estimated loss, estimated return)
Most of the loans are spent on home improvement. The amount ranges from 1000 to 35000$ (median = 7500$ and 500\(, 1000\) and 15000$ being the most common amounts). Since 2009 the average amount that is borrowed increased from around 4000$ to around 10000$ in 2013/2014. We also see a tremendous increase of loans being funded through the Prosper platform in the last year (reference time is about march 2014).
By far the most loans have one single investor (27440), but some have up to 1189 investors (the median is 68 investors). It looks like quite a number investors decide to invest in specific loans (providing the full amount) instead of using the tools like QuickInvest provided by the plattform.
The majority of the borrowers comes from California, followed by NY and Texas. Most of them have some kind of employment, only few indicated to be unemployed or retired (647 and 364). more than half of the borrowers are homeowners. The median stated monthly income is 5000$. There are some outliers with monthly incomes above 100000 $. If we cut off the highest 1% of incomes we get a slightly positively skewed distribution of incomes.
The main feature of interest is the interest rate the lender can achieve with the different loans. In the dataset we have different variants of this feature: - borrower rate (interest rate of the borrower) - lender yield (interest rate - 1% service fee) - estimated effective yield (borrower interest rate - servicing fee rate - estimated uncollected interest on charge-offs + estimated collected late fees)
I want to look at the lender yield. This is what the lender receives in an ideal case where the borrower pays back the full principal and interest. While the effective yield is equal to the lender yield minus estimated uncollected interest on charge-offs plus estimated collected late fees.
In addition the LP values give us summary information of what the borrower actually payed attributed to principal, interest, fees (LP_CustomerPrincipalPayments, LP_InterestandFees, LP_ServiceFees). From this we can calculate the amount of interest that the lender actually gets and compare to what is the expected amount of interest in the ideal case.
Time information like LoanOriginationDate/Quarter and ClosedDate can help to see how things evolved over time. Furthrmore information on the loan itself like the amount and the term for payback seems to be of interest when determining the lender yield. And certainly an indicator for the risk that is associated with the respective loan. Here the propser rating alias estimated loss is the indicator that I want to look at. It is derived from the Prosper score and the credit score and is used by Prosper to give an indicator for risk (estimated loss) associated with a loan.
I goggled how to calculate the the monthly payment (annuity): A = Kq^nT (q-1)/(q^nT - 1) with q = i/n where K is the loan amount, T is the term (here 3 or 5 years), n number of payments per year (in our case is 12 as we have a monthly payment) and i is interest rate per year. This way the monthly payment of the borrower is calculated. In addition I calculated the interest amount that the lender could expect in the best case where the borrower pays back all the principal and interest taking the term as indicated in the listing. However, as the borrower can pay back parts or the whole loan early whenever he can afford it, it can be interesting to look at the actual interest that was payed for completed loans and compare it to the best case interst. So in addition I calculated the difference between actual and best case interest payment. In addition I created the variable actualTerm which specifies how many months it took till the borrower actually payed back his loan (LoanOriginationDate - ClosedDate)
I converted date variables like LoanOriginationDate, CloseDate to the R Date type. I parsed the LoanOriginationQuarter parameter to be able to sort it first by year and then by quarter. Furthermore I converted some of the variables, like Term or listingCategory to factors. When I calculated the difference between CloseDate and LoanOriginationDate I found a few cases where the CloseDate < LoanOriginationDate. As this does not make sense I dropped these loans.
I found that the interest difference (expected (ideal) amount of interest - actual amount of interest) has a very long tail,so I tried a log transformation which leads to a more normal distribution of the data with a mean around 1000.
When looking at the distribution of lender yield over time, we already found that the distribution becomes more balanced and more aligned to the Propser Rating. The boxplot by loan origination quarter also show how the inter-quantile range becomes smaller. Furthermore the mean lender yield decreases. When looking at the violin plot we see that the bulk of the lender yields for the loans moves more to the mean starting in 2013. Prosper loans seem to move away from the high risk loans with high lender yields.
Things seem to be pretty stable over the quarters of a year, so breaking out by year should be enough.
When looking at the lender yield vs the Estimated Loss we see a very high variance in the lender yield. This proabably due to changes in how lender yields were assigned to different risk categories over the years? But breaking out by loan origination quarter shows that till 2010 there was a high variance in how the lender yield was aligned with the associcated risk, but then we get a more and more linear relation with almost no variance in 2014.
In 2012/13 we have 3 separate ‘lines’, wondering were this comes from. It seems to be agreed on term for the loan that also influenced the lender yield (the shorter the repayment term the lower the interest a borrower has to pay). However, in 2014 this effect cannot be seen any more, so Prosper does not seem to give better interest rates for shorter loan terms, the 12-months term goes away completely in 2014.
The loan amount also seems to determine the lender yield in a way. But this seems to be more related to the Prosper Score of the borrower. With lower a Prosper Score a borrower can only apply for lower amounts. At least this pattern emerges starting in 2011 to be become pretty clear in 2014.
When looking at the lender yield vs the Estimated Loss we see a very high variance in the lender yield. Breaking out by loan origination quarter shows that till 2010 there was a high variance in how the lender yield was aligned with the associcated risk, but then we get a more and more linear relation with almost no variance in 2014. So Prosper obviously adjusted its policy for assigning interest rates to loans according to the associated risk. While the agreed on loan term was still a factor that determined lender yield in 2012/13, this is no more the case in 2014, so Prosper seems to have decided to not give better interest rates for shorter loan terms any more.
This fits with the observation that I made with regard to the actual term/time that borrowers need to repay their loans. The majority of the loans are payed back early anyway (16623 out of 19498 completed loans) and no borrower seems to fully use their 5-year term, most do not even use their 3-year term as we have seen earlier.
When looking at the relation between interest difference (expected amount of interest (in the ideal case) - actual amount of interest) and actual repayment period we find that with shorter repayment period (actual term) the interest difference increases, but also its variance. Holding the actual term constant we see the highest interest difference for loans with a 5-year term and the lowest for loans with a 1-year term. Furthermore, holding actual term constant we also see higher interest difference for higher loan amounts again with increasing variance.
The relationship between Prosper Score and loan amount. Plotting the loan amount vs. the Estimated Loss and coloring by Prosper Score and then breaking out by origination year shows clearly how Prosper assigned different maximum loan amounts to people with different Propser Scores and how that evolved over time. In 2009 there seems to be no gradation, but till 2014 more and more clear steps can be seen. Starting in 2013 Propser increased its overall maximum loan to 35000 (25000 previously), but only for Prosper Score 7 or better. The association of lower maximum loan amounts with worse Propser Scores and higher amounts for better Prosper Scores respectively, is proabably one of the measures taken that led to the more and more balanced (normal) distribution of lender yields that we observed previously.
The relationship between lender yield and estimated loss in the years 2012 - 2014.
##
## Calls:
## m1: lm(formula = LenderYield ~ EstimatedLoss, data = subset(loanData,
## OriginationYear > 2010))
## m2: lm(formula = LenderYield ~ EstimatedLoss + Term_f, data = subset(loanData,
## OriginationYear > 2010))
##
## =======================================
## m1 m2
## ---------------------------------------
## (Intercept) 0.059*** 0.021***
## (0.000) (0.000)
## EstimatedLoss 1.588*** 1.606***
## (0.001) (0.001)
## Term_f: 36/12 0.032***
## (0.000)
## Term_f: 60/12 0.047***
## (0.000)
## ---------------------------------------
## R-squared 0.938 0.953
## adj. R-squared 0.938 0.953
## sigma 0.018 0.016
## F 1164623.865 516716.727
## p 0.000 0.000
## Log-likelihood 201011.893 211461.613
## Deviance 24.941 19.032
## AIC -402017.786 -422913.226
## BIC -401990.020 -422866.949
## N 77297 77297
## =======================================
##
## Calls:
## m1: lm(formula = I(sqrt(sub1$interest_diff/sub1$MonthlyAmount)) ~
## ProsperRating - 1, data = sub1)
## m2: lm(formula = I(sqrt(sub1$interest_diff/sub1$MonthlyAmount)) ~
## ProsperRating + Term_f - 1, data = sub1)
## m3: lm(formula = I(sqrt(sub1$interest_diff/sub1$MonthlyAmount)) ~
## ProsperRating + Term_f + actualTerm - 1, data = sub1)
##
## ===================================================
## m1 m2 m3
## ---------------------------------------------------
## ProsperRating: AA 1.323*** -0.049 0.342***
## (0.032) (0.032) (0.012)
## ProsperRating: A 1.663*** 0.129*** 0.585***
## (0.023) (0.029) (0.011)
## ProsperRating: B 2.185*** 0.453*** 0.888***
## (0.023) (0.030) (0.011)
## ProsperRating: C 2.550*** 0.741*** 1.145***
## (0.022) (0.030) (0.011)
## ProsperRating: D 2.433*** 0.877*** 1.386***
## (0.018) (0.029) (0.011)
## ProsperRating: E 2.559*** 1.119*** 1.569***
## (0.024) (0.030) (0.011)
## ProsperRating: HR 2.473*** 1.215*** 1.628***
## (0.024) (0.031) (0.012)
## Term_f: 36/12 1.258*** 1.867***
## (0.028) (0.011)
## Term_f: 60/12 3.430*** 3.765***
## (0.030) (0.011)
## actualTerm -0.080***
## (0.000)
## ---------------------------------------------------
## R-squared 0.820 0.930 0.990
## adj. R-squared 0.820 0.930 0.990
## sigma 1.069 0.665 0.250
## F 9900.795 22535.438 153300.805
## p 0.000 0.000 0.000
## Log-likelihood -22623.597 -15405.452 -475.269
## Deviance 17390.556 6741.139 949.317
## AIC 45263.193 30830.904 972.538
## BIC 45324.243 30907.216 1056.482
## N 15233 15233 15233
## ===================================================
I found that looking at the interest difference relative to the Monthly Payment for the loan reduces the variance tremendously which makes sense as the interest difference depends on the loan’s amount which is related to the Monthly Rate the borrower has to pay. I looked at difference in interest amount per Monthly Payment vs. actual repayment period, as I expect the actual payment period to be a determining factor for the interest difference. Holding the actual repayment period constant we always get a higher interest difference for a longer agreed on term (1-year vs 3-year vs 5-year). This is proabably because the expected amount of interest depends on the agreed on term.
Holding the actual repayment period constant we get 3 groups. The 3 groups seem to perfectly align with the agreed on terms. That means holding the actual repayment period and the agreed on term constant we always have a a higher interest difference for worse propser rating. Maybe because of the higher interest rate the borrower has to pay when getting a worse rating. I first log-transformed the interest difference per Monthly Payment, but then found using a sqrt-transformation shows a more linear relationship. So the last 2 plots suggest that we can use the feature variables to build a linear model for interest difference.
I tried buildng a linear model for Lender Yield and for interest difference. The model for Lender Yield is pretty simple I just used the features Estimated Loss and Term and a get a model that according to the r squared accounts for ~95% of the variation which is pretty good. When when looking at the residuals these are are reasonably well normally distributed in the Lender Yield range from 10-30%, but not outside this range as also the normal QQ-plot shows.
Regarding the model for interest difference, I looked at interest difference per Monthly Payment and used a sqrt-transformation. As features I used Term, Prosper Rating and actual repayment period. The r square suggests that Prosper Rating and Term account for 93 % of the variation. Adding actual repayment period increases this to 95%. But I was surprised that actual repayment period only added so “little” or better but that Prosper Rating, Term and Monthly Payment explain so much of the variance. This might be because the actual repayment period might already depend on the other features like Propsper Rating or Monthly Payment. When we look at the ggpairs plot I did previously we can see that it is correlated with the loan’s original amount (-0.114) and with Lender Yield (-0.412). And we know from previous plots that there is a strong correlation between Lender Yield, Estimated Loss and Prosper Rating and it is also obvious that Monthly Payment and the loan’s original amount are not independant. So we should leave the actual repayment period out. Furthermore this would make the model more valuable as this feature is not in the dataset and you don’t know the actual repayment period in advance. So it would be possible to predict the interest difference only with available features.
However, looking at the residuals these are not normally distributed as the QQ plot suggests. So further investigation is necessary to find additional features that lead to the systematic error.
——
The plot shows how Prosper aligs the Lender Yield and the risk associated with a loan more stricly over the years. As an example in the 2009-2011 timeframe there exist loans rated as ‘High-Risk’ which were assigned a Lender Yield of as low as 16.8% and up to 34% (median was already 30%), while in 2012-2014 the range reduced to only 27.7-30.7%. This is also reflected in the correlation. In the years 2009-2011 lender there already exists a strong correlation between Estimated Loss and Lender Yield (0.89), however it becomes almost perfect in the years 2012-2014 (0.97). Furthermore one can oberserve how the maximum amount that a borrower can apply for is more and more adjusted to the loan’s associated risk. While in 2009-2010 there is no clear gradation, in 2011 only loans with a Prosper Rating of AA or A have a loan amount of up to $25000. In 2012 this is extended to the B and C rated loans. Interestingly, in 2013 the maximum amount was increased to $35000 but only for AA - B rated loans, while C rated loans stayed at a maximum of $25000, Also the lower ratings have a specific maximum amount (D rated loans up to $15000, E rated loans 10000$ and HR rated only $4000).
The plot shows how lender yield and estimated loss (the risk associated with a loan) become more aligned over the years. In 2009 we see a great variance in the lender yield which even increases for loans with a higher estimated loss. The standard deviation grouped by Prosper Rating ranges from 2.3% for AA rated loans to up to 6.5% for HR rated loans. The variance in the lender yield reduces over the years, until we have an almost perfect linear relationship in 2014. In the years 2013/2014 the standard deviation of the lender yield grouped by Prosper Rating fluctuates only between 0.66% and 1.8%. So Prosper obviously adjusted its policy for assigning interest rates to loans according to the associated risk.
Furthermore we can see how the repayment term affects the lender yield that is assigned to a loan. Keeping the Estimated Loss constant we find that loans with shorter repayment term are assigned lower lender yields. This can be seen very clearly for the years 2011, 2012 and 2013. However, in 2014 the repayment term looses its effect on the lender yield. So Prosper seems to have decided to not give better interest rates for shorter loan terms any more. This fits with the observation that the majority of the loans are payed back early anyway (16623 out of 19498 completed loans).
This plot further investigates the effect that early repayment of a loan has on the amount of interest that a lender can expect to achieve. The plot shows the factors that primarily determine the interest loss a lender can expect not to achieve due to early repayment of a loan. We look at the the amount of interest relative to the loan’s amount as obviously the amount of interest depends on the borrowed amount. Further applying a Squareroot-transformation shows a negative linear relationship to the actual repayment period. Holding the actual repayment period constant we get 3 groups. The 3 groups seem to perfectly align with the agreed on terms. That means holding the actual repayment period and the agreed on term constant we always have a a higher interest difference for worse propser rating. Probably because of the higher interest rate the borrower has to pay when getting a worse rating.
Given that there are so many features in this dataset I started looking at features more or less randomly. As I have no background in finance I did not really know what to look at. In the beginning I struggled with understanding the meaning of many of the variables, but then found that plotting often helped in understanding their meaning and relation(in addition to reading documenation). The dataset contains many variables which are very closely related and to understand all the subtle differences between the different features was not always easy. For some the variables I found in the course of the investigation that another representation (e.g. transform to categorical variable) or reducing the number of categories helped (especially for time related variables, e.g. years instead of quarters/months) when plotting the data. Otherwise one cannot see anything on the plots as they are so overloaded with detail. Then I observed how the number of loans handled by the platform increased over time and that many loans are payed back early. Then I found that lender yield/borrower rate became more and more aligned with the associcated risk over the years and that at the same time that the distribution of the loans with regard to Propser Rating became more balanced, that means more normal. Previously one could observe many more high-risk loans than lower risk loans. So I looked at the factors that determine the Lender Yield more closely and it turned out that in the associated risk really is the determining factor. Furthermore I found it interesting to further investiagte how the early repayment affects the amount of interest that a lender can expect. Sure he gets his money back earlier and can reinvest it, but still it might be of interest as in some cases the money was payed back only after 1 to 3 months, which results in more effort for the lender when he has to take care of re-investing. The data contains loan data till the beginning of 2014, so it might be interesting to look at newer loan data to verify the findings. To further investiagte the interest difference it would be interesting to get more details on when which repayments occured, in this dataset we only get the aggregated amount of borrower payments.